114
Binary Neural Architecture Search
Algorithm 10 Search process of DCP-NAS
Input: Training data, validation data
Parameter: Searching hyper-graph: G, M = 8, e(o(i,j)
m
) = 0 for all edges
Output: Optimized ˆα∗.
1: while DCP-NAS do
2:
while Training real-valued Parent do
3:
Search a temporary real-valued architecture p(w, α).
4:
Decoupled optimization from Eqs. 4.43 to 4.53.
5:
Generate the tangent direction ∂˜
f(w,α)
∂α
from Eqs. 4.21 to 4.29.
6:
end while
7:
Architecture inheriting ˆα ←α.
8:
while Training 1-bit Child do
9:
Calculate the learning objective from Eqs. 4.26 to 4.32.
10:
Tangent propagation from Eqs. 4.33 to 4.41 and decoupled optimization from Eqs.
4.43 to 4.53.
11:
Obtain the ˆp( ˆw, ˆα).
12:
end while
13:
Architecture inheriting α ←ˆα.
14: end while
15: return Optimized architecture ˆα∗.
where
⊛
represents
the
Hadamard
product
and
η
=
η1η2.
We
take
ψt
=
−[M
m=1
E
e′=1 ˜ge′
∂wm
∂αe′,m , · · · , M
m=1
E
e′=1 ˜ge′
∂wm
∂αe′,m ]T . Note that,
∂w
∂α is unsolvable and
has no explicit form in NAS, which causes an unsolvable ψt. Thus we introduce a learnable
parameter ˜ψt for approximating ψt, which back-propagation process is calculated as
˜ψt+1 = | ˜ψt −ηψ
∂L
∂˜ψt |.
(4.51)
Eq. 4.50 shows that our method is based on a projection function to solve the opti-
mization coupling problem by the learnable parameter ˜ψt. In this method, we consider the
influence of αt and backtrack the optimized state in the (t + 1)-th step to form ˜αt+1. How-
ever, the key point in optimization is where and when the backtracking should be applied.
Thus, we define the update rule as
˜αt+1
m
=
P(αt+1
:,m , αt
:,m),
if ranking(R(wm)) > τ
˜αt+1
:,m ,
otherwise
(4.52)
where P(αt+1
:,m , αt
:,m) = αt+1
:,m + η ˜ψt ⊛αt
:,m and the subscript ·m denotes a specific edge.
R(wm) denotes the norm constraint of wm and is further defined as
R(wm) = ∥wm∥2
2, ∀m = 1, · · · , M,
(4.53)
where τ denotes the threshold for deciding whether or not to backtrack. We further define
the threshold as follows.
τ = ⌊ϵ · M⌋
(4.54)
where ϵ denotes a hyperparameter to control the percentage of edge backtracking. With
backtracking α, the supernet can learn to jump out of the local minima. The general process
of DCP-NAS is described in Algorithm 10. Note that the decoupled optimization can be